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Abstract 

This paper quantitatively explores the social and 
socio- semantic patterns of constitution of academic 
collaboration teams. To this end, wc broadly un- 
derline two critical features of social networks of 
knowledge-based collaboration: first, they essentially 
consist of group-level interactions which call for team- 
centered approaches. Formally, this induces the use 
of hypergraphs and n-adic interactions, rather than 
traditional dyadic frameworks of interaction such as 
graphs, binding only pairs of agents. Second, we ad- 
vocate the joint consideration of structural and se- 
mantic features, as collaborations arc allegedly con- 
strained by both of them. Considering these provi- 
sions, we propose a framework which principally en- 
ables us to empirically test a series of hypotheses re- 
lated to academic team formation patterns. In partic- 
ular, we exhibit and characterize the influence of an 
implicit group structure driving recurrent team forma- 
tion processes. On the whole, innovative production 
does not appear to be correlated with more original 
teams, while a polarization appears between groups 
composed of experts only or non-experts only, alto- 
gether corresponding to collectives with a high rate of 
repeated interactions. 



1 Introduction 

The mechanisms of academic collaboration are the 
focus of a long and established tradition of research 
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(Katz & Martin, 1997), from qualitative studies on 
cooperation and co-optation behaviors (Crane, 1969; 
Chubin, 1976; Latour & Woolgar, 1979) to more quan- 
titative approaches (deB. Beaver & Rosen, 1978 1979; 
dcB. Beaver, 1986; Melin & Persson, 1996). The lat- 
ter includes network-based studies, which are gener- 
ally aiming at understanding the structural determi- 
nants and patterns of collaboration (MuUins, 1972; 
Newman, 2001; Barabasi et al., 2002; Moody, 2004; 
Wagner & Leydesdorff, 2005; Lcahey & Reikowsky, 
2008). In this case, the quantitative formal framework 
of choice is the social network of dyadic interactions, 
addressing questions related to how ego-centered char- 
acteristics, in the broad sense, influence the likelihood 
of being involved in a collaboration. 

The Team Level and Networks 

Network studies, specifically in the context of scientific 
collaboration, indeed often focus on the level of the in- 
dividual in spite of a large amount of work on the ques- 
tion of group cohesiveness (Lott & Lott, 1965; BoUcn 
& Hoyle, 1990; Friedkin, 2004). There are wider im- 
plications of this focus on the ego-centered level: 

• By aiming at describing individual behavioral 
patterns, this perspective may overlook the infiu- 
ence of characteristics expressable at the meso- 
Icvel of the team itself. In particular, by focusing 
on dyadic interactions and relational patterns be- 
tween ego and alter (s), the presence of ego in a 
given collaboration is interpreted as a function 
of the characteristics of ego and those of alter(s), 
and of the characteristics of the various dyads be- 
tween ego and alter (s). 

• Further, the creation of a group results from a 
complex agreement and arrangement between all 
its members, who jointly decide to collaborate. 
As such, even when assuming that the behav- 
ior of ego may depend on non-dyadic, team-level 
characteristics, interpreting team formation pro- 
cesses as a sum of individual rationalities may 
oftentimes seem difficult, or irrelevant. Put dif- 
ferently, there are regularities in team formation 
processes which arc difficult to ascribe specifically 
back to individuals; it may appear more natural 
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and consistant to appraise the underpinnings of 
group formation at the group level. ^ 

To sum up, when dyadic frameworks are involved, 
collaboration teams are appraised under the lens of 
multiple one-to-one interactions. It should be no sur- 
prise: social network literature is itself overwhelm- 
ingly concerned with dyadic links. However, a size- 
able portion of sociology, starting with Simmel (1898), 
has long been concerned by wider frameworks of in- 
teractions, or so-called "social circles", which some 
authors have formalized to take directly into account 
non-dyadic relationships: Breiger (1974, 1990), for in- 
stance, proposed to use bipartite graphs to represent 
and analyze ties between actors and social groups. Fo- 
cusing on the group-level, Ruef (2002) quantitatively 
examined the contribution of several factors including 
gender, status, or ethnicity, in the preferential consti- 
tution of business founding teams. In a review study. 
Freeman (2003) explored various approaches previ- 
ously adopted in mathematical sociology to model 
two-mode data in order to account for the presence 
of subsets of people participating altogether in (sub- 
sets of) identical events. 

In this respect, it therefore first appears that aca- 
demic collaboration choices and dynamics should be 
characterized by investigating the meso-level of team 
formation. More precisely, it should be fruitful to fo- 
cus on teams rather than pairs of agents interacting 
together, thus advocating the use of hypergraphs or 
bipartite graphs rather than traditional frameworks 
based on graphs. Hypergraphs indeed feature hy- 
perlinks which connect arbitrary numbers of agents, 
while graphs feature links which connect only pairs of 
agents. In other words, considering hypergraphs pre- 
vents making the superfluous and plausibly debatable 
assumption that teams are equivalent to complete sub- 
graphs featuring one-to-one interactions between all 
its members (i.e. assuming for instance that a triad is 
equivalent to three dyads). 

Hybrid Networks of Actors and Concepts 

Secondly, collaboration massively depends on cogni- 
tive properties, in particular some cognitive fit be- 
tween team members, as agents plausibly compose 
teams in order to gather complementary competences. 
For instance, some economic models of knowledge cre- 
ation consider matching rules based on the similar- 
ity of agent profiles, as elements of a vector space, 
to explain economic network structure (Cowan et al., 
2002). In other words, equal attention should be given 

^Note that what we call a "team" here actually relates to a 
group that is involved in the production of an academic paper, 
i.e. the team of coauthors that produces it; it does not corre- 
spond to the more or less explicit notion of team that may exist 
in some research labs. 



to social and semantic features, which are tradition- 
ally left apart in the literature, although the existence 
of homophily-driven interactions has been underlined 
in numerous works (McPherson et al., 2001). 

Our main hypothesis is that one cannot correctly 
understand the underlying social processes if both so- 
cial and semantic dimensions of, e.g., scientific activ- 
ity, are not considered as two interdependent dynam- 
ics (Roth, 2006; Roth & Cointet, 2010). Going further, 
we construe scientific dynamics as made of groupings 
of both agents and concepts: the epistemic dynam- 
ics, i.e. the collective scientific knowledge construc- 
tion, is made of events which simultaneously involve 
compounds of actors and concepts. In line with the 
program introduced by Gallon (1986), we will appraise 
scientific dynamics as made of constant reconfigura- 
tion and re-negotiation of collectives of both humans 
and non-humans. 

In this respect and more broadly, in addition to fo- 
cusing on teams, we thus advocate the enrichment 
of the notion of team by considering teams as joint 
groupings of both agents and semantic items. 

Knowledge-based teamwork 

The interest in the social epistemology of academic 
communities also has a broader reach. As a knowledge 
production arena, science is indeed likely to share fea- 
tures found in other collaborative knowledge creation 
contexts. 

(i) Collaboration in knowledge production systems. 

This issue may shed light, to some extent, on the 
interaction processes underlying, broadly, col- 
laborative knowledge production. These con- 
texts indeed define a particularly common class 
of social networks of collaboration, where agents 
jointly and collectively interact for purposes 
of knowledge production, in the broad sense. 
This encompasses activist groups and politi- 
cal epistemic communities (Ruggie, 1975; Haas, 
1992), scientific communities (dcB. Beaver & 
Rosen, 1978-1979; Laband & Tolhson, 2000; 
Jones et al., 2008; Stokols et al., 2008; Lea- 
hey & Reikowsky, 2008) and more specifi- 
cally research projects (Laredo, 1995, 1998), 
open-source development communities (Kogut & 
Metiu, 2001) and discussion lists and forums 
(Gonstant et al., 1996; Welser et al., 2007), wiki 
platform-mediated communities (Bryant et al., 
2005; Levrel, 2006), artists gathering for a the- 
ater performance (Uzzi & Spiro, 2005) or making 
a movie (Faulkner & Anderson, 1987; Ramasco 
et al., 2004), board members making collective 
decisions (Davis & Greve, 1996). 

(ii) Collaboration in teams. 



2 



This kind of relatively autonomous collabora- 
tion mode has to be understood in a con- 
text where traditionally vertical and hierarchi- 
cal organizations have recently been functioning 
in increasingly horizontal and networked ways 
(Powell, 1990; Miles & Snow, 1996; Smangs, 
2006). This contemporary so-called "network 
governance" involves dynamic coalitions of ac- 
tors both at organizational and individual levels, 
increase of teamwork and frequent group recon- 
figurations (Jones et al., 1997). This shift is par- 
ticularly sensible in contexts where agents are 
relatively free to group to form casual alliances 
and where collaboration sometimes appears to 
be self-organized. 

In this respect, science appears to be a prototypical 
case of such teamwork-based systems (deB. Beaver, 
1986; Adams ct al., 2005; Wuchty ct al., 2007) — 
scientific knowledge production essentially involves 
events where researchers jointly work to manipulate 
and introduce concepts. It is additionally one of the 
most accomplished context of knowledge-based collab- 
oration, as well as one of the most explicit, by its very 
stigmergic^ nature: papers indeed constitute a con- 
crete, often public instance of these gatherings and 
therefore provide an opportunity to understand the 
impact of these collaborations on the dynamics of sci- 
ence. On the empirical side, we thus rely on large 
bibliographic databases. 

As such, our approach does not pretend to em- 
brace the whole complexity of knowledge- intensive or- 
ganizations, in particular the intricate co-evolutionary 
processes existing between formal organizations and 
more local team-based and individual-based decisions 
(Lazega et al., 2008). However, the metholodogy we 
propose is able to shed some original light on portions 
of the dynamics of these knowledge production sys- 
tems. 

The paper is organized as follows: in Sec. 2, we 
present the framework and support several hypothe- 
ses on socio-semantic team-based collaboration. Sec. 3 
introduces the protocol and methods, while Sec. 4 
presents the results, which wc then discuss in light 
of the initially proposed hypotheses. 

2 Framework 

As follows from the introduction, we hence argue that 
two features arc key in extending the understand- 
ing of, one hand, collaboration networks, and on the 

^ "Stigmergic" : that is, leaving traces susceptible to guide 
the work of others. For an extensive discussion of this notion, 
see Karsai &i Penzes (1993). 



other hand and additionally, knowledge production 
networks: 

1. Group effects underlie and partially determine 
dyadic interactions: affiliation to teams of collab- 
oration, membership in identical epistemic com- 
munities, for instance, structure and influence the 
very formation of these interactions. 

2. In the case of social networks of knowledge, these 
underlying groups are both social (work commu- 
nities) and semantic (epistemic communities). In 
particular, the choice of collaboration partners is 
likely to highly depend on cognitive similarity. 

More to the point, in terms of strictly social and 
strictly semantic associations, we first aim at checking 
the following simple hypotheses, by comparing what 
happens empirically with what would have happened 
if teams had been formed strictly by chance (i.e. by 
comparing empirical teams with a null-model featur- 
ing random compositions of teams). 

(HI). Teams with a high rate of interaction rep- 
etition should be more likely, as could be 
expected because of social cohesion (BoUcn 
& Hoylc, 1990; McPhcrson & Smith-Lovin, 
2002; Friedkin, 2004) or organizational con- 
straints (Rodriguez &; Pepe, 2008). 

(H2). Teams where a high proportion of concepts 
are repeatedly associated should be more 
likely — as assumed by co-word analysis 
(Gallon et al., 1986; Noyons & van Raan, 
1998), where frequent associations of terms 
are supposed to define conceptual cores and 
field boundaries. 

(H3). Papers with a higher semantic originality 
(i.e. new association of concepts) should be 
those where there is a higher number of new 
interactions.'' Put differently, as suggested 
by social and semantic repetitions assumed 
by HI and H2. teams with a high number of 
repeated interactions should tend to produce 
papers that have smaller semantic/topical 
originality; which in some sense belong to 
a narrower subficld of research (Lcahey & 
Reikowsky, 2008). 

Then, we appraise the socio-semantic composition 
of teams. We more precisely focus on the distinction 

■^As Gallon (1994, p. 414) sums up from the existing litera- 
ture, 

"The more numerous and different these heteroge- 
neous collectives are, the more the reconfigurations 
produced arc themselves varied" 
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between agents who are already familiar with some 
concepts involved in the interaction, and those who 
are not. This approach will more broadly inform us 
about the cognitive specialization of teams. 

(HI). Because of both scientific specialization 
(Chubin, 1976) and homophily (McPhcrson 
et al., 2001; Stokols ct al., 2008), teams 
gathering around a given topic should gen- 
erally involve more individuals knowledge- 
able about this given topic. 

(HII). Teams with a balanced composition of ex- 
perts in a given field should produce more 
innovation (Ancona & Caldwell, 1992), 
which in terms of networks could be trans- 
lated into: 

• more semantic originality, i.e. novel as- 
sociations of concepts, 

• more social originality, i.e. novel inter- 
actions between agents. 

3 Protocol and methods 

In line with this focus on socio-semantic aspects, we 
will thus endeavor at exhibiting how new teams are 
formed by considering both social and conceptual past 
acquaintances of scientists involved in new collabora- 
tions. We will concretely describe the semantic dimen- 
sion in terms of attributes qualifying topics of interest 
of authors and the social dimension as structural and 
relational properties in the dynamic collaboration net- 
work — which altogether will enable us to confirm or 
refute the previous set of hypotheses. 

3.1 Datasets 

Our empirical analysis focuses on collaboration 
databases, which reveal a large part of the underlying 
collaboration activity, including social links between 
individuals or conceptual acquaintances of each indi- 
vidual (i.e. details regarding which topics which agents 
are familiar with). These datasets provide temporal 
information on teams, gathering agents and the top- 
ics they work on, assuming that topics are described 
by the very terms used in paper abstract. For each 
dataset, we focus on a set of no more than a hundred 
of relevant terms. These terms are selected with the 
help of an expert of the corresponding field and are 
such that they appropriately cover the most signifi- 
cant topics of each field. 

We use the following datasets, defined either from a 
semantic perspective (using e.g. field names) or from 
a social perspective (using e.g. scientific assemblies), 
and involving both large and small communities: 



1. Embryo logists working within a given and well- 
determined subfield — the zebrafish, on a pe- 
riod of 20 years (1985-2004). Data was ex- 
tracted from the publicly available database Med- 
line, which eventually yields a dataset of 6, 145 
articles (13 084 authors, 71 word classes). 

2. Scientists working on rabies from the same kind of 
MedLine extraction as for zebrafish embryologists 
— the observed period spans from 1985 to 2007. 
This ends up with 4 648 events (9 684 authors, 70 
word classes). 

3. Scientific committee members for JEMRA meet- 
ings'*: this dataset includes the publications of 
an initial set of 168 scientists involved in these 
meetings, gathered from 1985 to 2007. This ends 
up with 5 893 papers (15 375 authors, 69 word 
classes) . 

4. Scientific committees members for JECEA meet- 
ings'': similarly, publications of an initial set of 
178 scientists are gathered from 1985 to 2007. 
This ends up with 8 685 papers (21 195 authors, 
85 word classes). 

3.2 Hypergraph-based definitions 

Now, these agents and concepts formally define an 
evolving hypergraph where each article is a hybrid hy- 
perlink gathering both authors and the topics involved 
in the collaboration, as partly exemplified by Fig. ??. 

In what follows, we describe comprehensively our 
formal framework (Sec. 3.2.1), which, basically, allows 
us to gather both agents and concepts in a dynamic 
setting and to define which agents are new, or not 
{newcomers vs. veterans), which concepts are new, or 
not {novelties vs. standards), and which agents have 
used which concepts in the past, or not {neophyte or 
experts). 

Building upon these definitions, we will then pro- 
pose a scries of hypergraphic measures (Sec. 3.2.2) — 
that is, measures at the level of teams, or non-dyadic 
measures — which cover the proportion of experts in 
a given collaboration {expertise ratio) and the origi- 
nality of participants in a team {hypergraphic repeti- 
tion, i.e. describing to what extent a team does gather 
agents, or concepts, which were jointly associated, at 
the team- level, in previous periods). For instance, a 
team with an expertise ratio of one will be such that 
all agents are experts; a team with a hypergraphic rep- 
etition of one, in terms of agents, will be such that all 

* Joint FAO/WHO Expert Meet- 

ings on Microbiological Risk Assessment, 

http://www.fao.org/ag/agn/agns/jemra_index_en.asp 

^ Joint FAO/WHO Expert Committee on Food Additives, 
http: //www.who.int/ipcs/food/jecfa 
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its agents will have altogether previously collaborated 
(it is zero in case none of the agents have previously 
been associated). 

Then, we present a methodology (Sec. 3.2.3) for 
computing how much the empirical data diverges from 
a random setting with a comparison between the ac- 
tual observed data and a uniform null-model of hy- 
pergraph evolution. Put simply, we will appraise how 
much teams with, e.g., a given hypergraphic repeti- 
tion ratio, are forming significantly more often than 
could be expected by chance. This latter tool will be 
the cornerstone of the empirical testing of hypotheses 
1-2-3 & I-II. 

3.2.1 Objects 
Hypergraphs. 

Formally, a hypergraph features nodes and hyperlinks, 
which describe n-adic interactions among any subset 
of nodes. It is therefore a generalization of the notion 
of graph whose links only describe dyadic interactions, 
i.e. between pairs of nodes. As such, any hyperlink 
corresponds to any grouping of agents and any kind 
of social circle: it may describe social events, organi- 
zations, families, teams, etc. A hypergraph is also iso- 
morphic to a bipartite graph, where agents on one side 
are connected to various afhliations, groups or events 
on the other side; as such a structure which reifies 
the duality of social groups (Breiger, 1974; Freeman, 
2003). See Fig. ??. 

Beyond the simple observation of the structure of 
such networks, several studies have endeavored at re- 
constructing structural properties typically induced 
by the hypergraphic setting — namely, that agents 
interact within groups of some sort — rather than 
using dyadic interactions only: in this direction New- 
man et al. (2001); Ramasco et al. (2004); Guimera 
et al. (2005), inter alia, examine the structure of a 
social network whose dyadic links stem from teams — 
team composition is first empirically appraised then 
stylized and used as a basis for what essentially is 
a clique addition process. In these models however 
the focus remains on dyadic relationships or dyadic 
interaction behaviors, rather than truly hypergraphic 
measures. 

In contrast, the focal level of analysis of the present 
study is the hypergraph and its hyperlinks. 

Epistemic hypergraphs. 

To bind the social and semantic aspects, we introduce 
the notion of epistemic hypergraph (£t using: 

(i) a set of agents A, 

(ii) a set of concepts C, and 



(iii) the epistemic hypergraph itself (S C'P(^UC), 
describing the joint appearance of agents and 
concepts, and henceforth the usage of the lat- 
ter by the former, where each collaboration is a 
hyperlink e S P(y^UC). 

As such, an "episiemic hypergraph" is properly de- 
fined by a triple {A,C, (£). Dynamic epistemic hyper- 
graphs are indexed with time, <Bt, and are considered 
to be growing: t < t' ^ €4 C 

At each timestcp, new teams are formed and thus 
hyperlinks appear, we denote this set by A£(, such 
that €t = £t_iUA€t. Note that A(Bt is not necessarily 
equal to \ £t_i since some teams forming at t may 
already have appeared in £t_i. 

See an illustration of this framework on Fig. ??. 

We also define a projection operation for hyperlinks: 
given a hyperlink z G (S.t and a subset E C AUC, the 
projection of e on the set E is noted = zCi E. For 
instance, the fact that all hyperlinks contain at least 
one agent translates as Ve, e'^ 7^ 0. 

We can thus define a (dynamic) collaboration hy- 
pergraph {e-^ I e S £t} = 2tt C V{A) whose hyperlinks 
connect team members, and a semantic hypergraph 
{e'' I e e €t} = Q '^{C) whose hyperlinks are sets 
of concepts mentioned in a given collaboration. In 
particular, %t is isomorphic to a bipartite graph of 
collaboration, traditional in the literature (Newman 
ct al., 2001; Guimera et al., 2005). 

Neophytes and newcomers. 

We say that an agent a is, at t, a "neophyte in a given 
concept c G C" if s/he has never used c at t: formally, 
if e €t-i,{a,c} C e. Otherwise, s/he is called an 
'^expert" . 

We say that an agent a is a "newcomer" if s/he has 
never published before t, which is equivalent to say 
that $z G iBt-i,a 6 e. Otherwise, s/he is called a 
"veteran" . 

Similarly, we say that a concept c is a "novelty" 
at t if all agents are neophyte in this concept: i^e G 
£t_i,c G c. Otherwise, it is a "standard" . 

3.2.2 Measures 

Homogeneity of teams and expertise ratio. 

Given these basic concepts, we may first examine the 
composition of teams using a simple hypergraphic 
measure pertaining to the composition of teams in 
terms of a simple proportion of experts: "how much 
are teams made of people familiar or not with a given 
concept which is used by the team?" . 

We call this proportion expertise ratio, noted ; 
for example, a paper on "ants" where half of the au- 
thors already worked on ants has a ratio of expertise 
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in "ants" of .5. Formally, the expertise ratio ^c.t(e) in 
concept c S e'' at time t of team e is given by: 



Zebrafish 



I {a £ I a is an expert in c}| 

|{«ee^}| 



This notion, derived from the composition of a given 
team in terms of experts vs. neophytes in a given con- 
cept, expresses the socio-conceptual homogeneity of a 
team. See Fig. ??. 

Hypergraphic repetition. 

We may also express the degree of originality of the 
composition of a team and its subsequent groupings 
by measuring, in the broad sense, the proportion of 
already-existing associations of items, be it agents or 
concepts. More to the point, we may talk of social 
originality by describing the rate of new associations 
of agents in a given team; or, dually, we will denote 
conceptual originality by describing the proportion of 
new associations of concepts in a paper. *^ 

More precisely, in the dyadic case, an interaction is 
said to be repeated if the two nodes already jointly 
appeared in a previous collaboration. We extend this 
notion to the hypergraphic case: 

• We first say that a set of nodes has "previously 
co-occurred^ if there is at least one previously- 
existing (< t) hyperlink including this set. We 
define the corresponding function pt as follows: 



1 if 3c' e (£f_i,e C e' 
otherwise. 



Thus, for instance, if a and a' never collaborated 
at t, we have pt{{a, a'}) = 0. 

• The notion of hypergraphic repetition is prop- 
erly defined for veteran agents and/or standard 
concepts — by definition, repetition cannot oc- 
cur with newcomers or novelties. 

Therefore, in the following formulas, hyperlinks e 
must be such that Ve G e, 3e' e €t~i such that 
e G e'. In other words, we ensure the use of such 
hyperlinks by considering, Vc G £t, truncated 
hyperlinks e restrained to the set of previously- 
existing nodes, i.e.: 



e = e n 



u 



Wc then compute the hypergraphic rate of rep- 
etition for a hyperlink e £ £t as the proportion 



®In which case, new concept associations are new with re- 
spect to the whole system, consistently with the social case: 
i.e. this refers to concept associations which never existed in 
any paper of the preceding periods. 
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Figure 1: Empirical distribution of the hypergraphic 
repetition rate for concepts, rt(e''). 



of subsets of this hyperlink that have previously 
co-occurred: 

= rt{c) 



|c'l>2 



Depending on the objectives, it might be appro- 
priate to weight the relative importance of each 
subset of hyperlink e in the sums, for instance ac- 
cording to their size: for a discussion on weighting 
functions, see Appendix A. 

Let us consider the following example: given a new 
collaboration e forming at t, rt(e'') thus measures its 
hypergraphic concept repetition, i.e. how much the 
concepts of e'' have been jointly associated, altogether, 
in previous periods. Eventually, we may plot the dis- 
tribution of such values rj for all teams, as shown in 
Fig. 1. Put simply, it shows that about a third of 
teams have a hypergraphic conceptual repetition of 1, 
i.e. all their concepts e'' have already jointly been used 
in the past. 

3.2.3 Estimating propensities of team forma- 
tion 

Null-model of hypergraph. 

A null-model of new teams based on agents (rcsp. 
concepts) is defined such that, at each period t, we 
randomly create new teams respecting empirically- 
observed numbers of agents (resp. concepts) and their 
respective numbers of team participations. What is 
fundamentally randomized is the exact composition of 
teams in terms of who is collaborating with whom: in 
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our null-model, team members are basically reshuffled. 
Put differently, the null-model expresses the composi- 
tion of teams as would be happening by chance. 

In other words and more practically, 

• we empirically measure: 

1. the size of new teams appearing at t, i.e. the 
distribution of \f^\ (rcsp. |e''|) for e G AC;t, 

2. for every element e E A (resp. e e C), the 
number of times it appears in newly-formed 
teams, i.e.: 

|{e e A(£t such that e 9 e}| 

• we then generate an artificial, uniformly random 
set of new teams A€t C P{ALiC) which respects 
above-mentioned distributions, that is: 

1. same distribution of sizes of new hyperlinks, 

2. same distribution of participations of ele- 
ments in these new hyperlinks. 

In the remainder, we examine and compare the 
properties of the empirical A £4 and the randomly- 
created A (St. 

Propensity. 

In particular, we define the propensity of team forma- 
tion with respect to a given function / of a hyperlink 
(e.g. the hypergraphic rate of repetition) as, for each 
possible value x of the function, the ratio between the 
observed number of new hyperlinks (events) e such 
that /(e) = X and the randomly-created number of 
such events: 

|{eeA£t such that /(e) = 

Tit(x) = — (1) 

|{e e A(Bt such that /(e) = x}\ 

Obviously, if this quantity is above 1 for a certain 
value of X, we say that this type of team empirically 
occurs more than expected; otherwise, less. 

4 Results 

We may now empirically appraise hypotheses 1-2-3 & 
I-II. 

4.1 Simulation of the null-model 

We start by measuring the propensity of team for- 
mation, first with respect to simple expertise ratios 
and, second, with respect to hypergraphic repeti- 
tion rates. To this end, we simulate 2, 500 instances 



of above-defined null-model-based epistemic hyper- 
graphs, which are therefore random hypergraphs.^ We 
then compare the composition of teams thus obtained 
with that of the empirical data. 

Expertise ratio: socio-semantic homogene- 
ity/heterogeneity 

Distinguishing agents who have already been asso- 
ciated with a concept ("experts") and agents who 
are not yet associated ("neophytes"), we thus as- 
sess whether real teams involve agents of mixed back- 
grounds or not, relatively to a randomly-built set of 
teams. Details of this comparison are displayed on 
Fig. 2 for the zebrafish case, which illustrates the com- 
position of teams for various levels of expertise ratios, 
in both the real and random cases. Corresponding 
propensities, for both cases, are shown on Fig. 3: their 
shapes are consistent across all datascts and consist of 
a U-shaped curve above 1 for extreme values of exper- 
tise ratios (towards and towards 1) and below 1 for 

central values (typically, from 0.1 to ca. 0.4 0.5). 

Empirically, we thus observe that there is a signif- 
icantly high propensity of formation of teams com- 
posed of either experts only or newcomers only, with a 
significantly lower propensity for mixed teams. Teams 
involving a mixed proportion of experts and newcom- 
ers are thus less frequent than they should be. 

Hypergraphic rate of repetition: social or se- 
mantic homogeneity/heterogeneity 

Measuring now propensities of group formation with 
respect to hypergraphic rates of repetition, we can 
empirically exhibit the existence and influence of an 
implicit group structure which drives recurrent team 
formation — this group structure exists along the two 
above-mentioned dimensions: 

• Social homogeneity/heterogeneity: With respect 
to agents, the hypergraphic rate of repetition 
measures the extent to which a team features 
repeated interactions among former collabora- 
tors. Once again, our results have to be com- 
pared to the null hypothesis for which teams are 
formed randomly. Figure i-top features the cor- 
responding propensities which arc several orders 
of magnitude higher than 1 for teams with a non- 
negligible proportion of such repetitions (r > .1) 

• Conceptual homogeneity/heterogeneity: Simi- 
larly, we measure the propensity of team forma- 
tion with respect to repeated concept associa- 

^For reasons of computational complexity, we consider event 
sizes not greater than 10 agents and 10 concepts — with this 
constraint we still consider no less than 89% of the total original 
number of teams. 
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Figure 2: Probability distribution of the expertise ratio on all teams aggregated over all years and all concepts 
{left: observed, right: theoretical). The computation of propensities below will be based on the ratio of such 
observed distributions over theoretical ones. 



tions, addressing the following issue: "are there 
cores of concepts which are likely to be recur- 
rently associated, given that they were previously 
jointly used in previous papers?" Results, shown 
on Fig. 4-bottom, demonstrate again (and even 
in a stronger fashion than in the social case) 
that there is a significant bias towards gathering 
groups of concepts which were previously associ- 
ated. 

4.2 Discussion of hypotheses 

It is now possible to review and check the afore- 
mentioned hypotheses. As follows from Fig. 4, it 
is clear that (HI) and (H2) are quantitatively con- 
firmed: teams with a high proportion of interaction 
repetitions or with a high proportion of repeated con- 
ceptual associations are much more likely than should 
be expected by chance. 

Additionally, and irrespective of the simulation 
model, we check if there is a correlation between se- 
mantic and social hypergraphic rates of repetition. As 
shown on Fig. 5, there seems to be no correlation be- 
tween social and semantic originality in a collabora- 
tion (in our datasets, which come from varied back- 
grounds but are also focused on particular epistemic 
communities). This invalidates (H3): in other words, 
contrarily to intuition, new semantic associations do 
not stem more from original teams than from repeated 
teams. In other words, semantic innovation is as likely 
from agents who, globally, previously collaborated, as 
from new collaborations.^ 



1.5 



*This does not mean, however, that the backgrounds of pre- 
vious collaborators who are causing semantic innovation should 
necessarily be similar (semantic innovation might indeed come 
from repeated collaboration with individuals who have varied 



H^JEGFA 
— JEIVIRA 
■'-RABIES 
V ZEBRAFISH 




[0,0.16[ [0.16,0.33[ [0.33,0.5[ [0.5,0.66[ [0.66,0. 83[ [0.83,1] 



Figure 5: Average semantic hypergraphic repetition 
ratio (y-axis) for a given range of social hypergraphic 
repetition ratio (x-axis). (Error bars correspond to 
95% confidence intervals with respect to averages on 
each repetition ratio bin (in abscissa), such as e.g. 
[0,0.1[.) 



As regards expertise, (HI) — "teams gathering 
around a given topic should involve more individu- 
als knowledgeable about it" — is partially confimed 
and partially contradicted by the empirical evidence. 
Firstly, teams with a high proportion of experts in a 
concept involved in the collaboration are much more 
likely, as shown on the right side of each graph on 
Fig. 3, whose values are significantly above 1. 
Yet, secondly, teams with a very small proportion 
of experts regarding a concept, i.e. high proportion 



semantic backgrounds). 
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Figure 3: Propensity for proportions of experts per article, from our real data vs expected from our random 
theoretical model — averaged over all years, then over all concepts. (Error bars correspond to 95% confidence 
intervals with respect to concept averages.) 



of neophytes, are also significantly more likely, sug- 
gesting that part of the use of new concepts is also 
due to teams almost completely new to such concepts 
(even if, as is proved by (HI), these very teams are 
still more likely to stem from repeated collaborations). 
Put bluntly, new concept usage, and thus part of inno- 
vation, appears to stem both from teams significantly 
ignorant of such concepts and from teams globally 
knowledgeable about such concepts. 

From this observation that "all-experts" and "all- 
neophytes" teams are more likely, we may expect that 
such teams stem from underlying groups (either still 
working on the same topic, or working on a new topic, 
respectively) and thus have a higher social hyper- 
graphic repetition ratio. Similarly, those teams stem- 
ming from underlying groups are likely to carry nor- 
mal, specialized science and have higher semantic hy- 
pergraphic repetition ratio (or lower originality) . Fig- 



ure 6 sheds light on these issues by comparing average 
hypcrgraphic repetition ratios with expertise ratios. 
In particular, we observe that teams with a balanced 
composition of experts have a higher social original- 
ity (lower social hypergraphic repetition ratio) , yet se- 
mantic originality remains constant across various val- 
ues of expertise ratios. This partially confirms (HII) 
as regards social originality and partially invalidates 
it as regards semantic originality: indeed, social orig- 
inality is increased when there is a mixed proportion 
of experts, but not semantic originality. 



5 Concluding remarks 

We presented a formal framework to appraise the un- 
derpinnings of collaboration formation with a hyper- 
graphic approach which encompasses both the meso- 
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Figure 4: Propensity of team formation (random hypergraph vs. real data) with respeet to hypergraphic 
repetition ratios for agents (top) and concepts {bottom). (Values are averaged over all years, error bars 
correspond to 95% confidence intervals with respect to these averages.) 
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Figure 6: Average hypergraphic repetition ratios (y-axis) with respect to expertise ratios (x-axis): social {dashed 
line) and semantic {plain line) cases. (Error bars correspond to 95% confidence intervals with respect to averages 
on each expertise ratio bin (in abscissa), such as e.g. [0,0.1[.) 



level of teams and the joint dynamics of social and 
semantic features. This allowed the quantitative esti- 
mation of the relative strength of social and semantic 
patterns behind academic team formation, by empir- 
ically studying several communities of scientists and 
estimating how the composition of teams, both cog- 
nitively and socially, diverges from a null hypothesis 
where collaborators and/or topics would be randomly 
chosen. 

We could thereby confirm several hypotheses as well 
as invalidate some hypotheses which had been estab- 
lished in a relatively qualitative fashion in the litera- 
ture, or in a possibly misleading dyadic form. More 
precisely, our measurements suggest a mechanism of 
team formation based on (i) a high likeliness to repeat 
previous collaborations patterns, not only dyadic but 
also n-adic interactions {n > 3) and (ii) a sensible 
confinement of groups of individuals, whose collab- 
orations appear to depend largely on the history of 
team memberships, and, similarly, a sensible seman- 



tic confinement where associations of concepts depend 
largely on the repetition of previous associations. On 
the whole however, the originality of a paper does not 
seem to stem from an original composition of the un- 
derlying team, while a polarization appears between 
groups made of experts only or made of non-experts 
only, which altogether correspond to collectives ex- 
hibiting a high rate of repeated interactions. 

Perspectives on models of academic collaboration. 
Taking into account an implicit group structure, both 
at a social and at a socio-semantic level, as evidenced 
by the data, is likely to faithfully account for the struc- 
ture of academic collaboration networks. Indeed, the 
underlying low-level dynamics is plausibly closer to 
hypergraphic team formation mechanisms than would 
be allowed by a design based on dyadic interactions 
only. As said before, this should not yield a lack of 
organizational thinking regarding the underpinnings 
of scientific production: beyond the step that con- 
stitutes our present contribution, an exhaustive ap- 
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proach about this type of collaboration mechanisms 
would indeed have to involve both epistemic hyper- 
graphs and organizational features. In this respect, 
while we claim and show that hypergraphs make 
it possible to capture some interesting processes of 
team-based, knowledge-intensive production systems, 
we also emphasize that the richness of organizational 
mechanisms should not be shadowed by this formal- 
ism. 

In line with our results, it should also be possible 
to determine which features, at the level-team, favor 
better collaborations — not only in terms of semantic 
originality, but also in terms of quality and creativity 
of output, in a broad sense. 
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A Weighting functions 

A weighted hypergraphic repetition rate could be writ- 
ten as follows: 

c'Cc 

( . |c'l>2 



»e{2 |c|} 



where w, is a weight function (given e, : N — R) 
which makes it possible to give more or less weight to 
particular subset sizes. 
For instance: 

• taking i(;c(i) = 1, i.e. actually no weighting as 
has been used in the paper, 



|e'l>2 



if instead w^{i) = i, i.e. weighting proportional 
to the size of the considered subset, 



^*(^)= |e|(2lchi-l) E 

c Ce 
|e'l>2 



• if finally w^{i) = ('^') , i.e. weighting propor- 
tional to the number of possible subsets of size |e| 
in a set of size i, 

c'Cc V|c'|y 
|e'l>2 
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